test size
Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples
Sreeram, Shiva, Maalouf, Alaa, Sharma, Pratyusha, Rus, Daniela
Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM's weight matrices can boost downstream accuracy -- without any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected -- eliminating the layer-by-layer sweep, (ii) The gradient of each matrix's singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data -- both for computing the indicative gradients and for measuring the final accuracy -- suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a result, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets -- entirely without fine-tuning.
Enhancing Classification with Semi-Supervised Deep Learning Using Distance-Based Sample Weights
Abedinia, Aydin, Tabakhi, Shima, Seydi, Vahid
Recent advancements in semi-supervised deep learning have introduced effective strategies for leveraging both labeled and unlabeled data to improve classification performance. This work proposes a semi-supervised framework that utilizes a distance-based weighting mechanism to prioritize critical training samples based on their proximity to test data. By focusing on the most informative examples, the method enhances model generalization and robustness, particularly in challenging scenarios with noisy or imbalanced datasets. Building on techniques such as uncertainty consistency and graph-based representations, the approach addresses key challenges of limited labeled data while maintaining scalability. Experiments on twelve benchmark datasets demonstrate significant improvements across key metrics, including accuracy, precision, and recall, consistently outperforming existing methods. This framework provides a robust and practical solution for semi-supervised learning, with potential applications in domains such as healthcare and security where data limitations pose significant challenges.
A New Flexible Train-Test Split Algorithm, an approach for choosing among the Hold-out, K-fold cross-validation, and Hold-out iteration
Bami, Zahra, Behnampour, Ali, Doosti, Hassan
Artificial Intelligent transformed industries, like engineering, medicine, finance. Predictive models use supervised learning, a vital Machine learning subset. Crucial for model evaluation, cross-validation includes re-substitution, hold-out, and K-fold. This study focuses on improving the accuracy of ML algorithms across three different datasets. To evaluate Hold-out, Hold-out with iteration, and K-fold Cross-Validation techniques, we created a flexible Python program. By modifying parameters like test size, Random State, and 'k' values, we were able to improve accuracy assessment. The outcomes demonstrate the Hold-out validation method's persistent superiority, particularly with a test size of 10%. With iterations and Random State settings, hold-out with iteration shows little accuracy variance. It suggests that there are variances according to algorithm, with Decision Tree doing best for Framingham and Naive Bayes and K Nearest Neighbors for COVID-19. Different datasets require different optimal K values in K-Fold Cross Validation, highlighting these considerations. This study challenges the universality of K values in K-Fold Cross Validation and suggests a 10% test size and 90% training size for better outcomes. It also emphasizes the contextual impact of dataset features, sample size, feature count, and selected methodologies. Researchers can adapt these codes for their dataset to obtain highest accuracy with specific evaluation.
Toward Corpus Size Requirements for Training and Evaluating Depression Risk Models Using Spoken Language
Rutowski, Tomek, Harati, Amir, Shriberg, Elizabeth, Lu, Yang, Chlebek, Piotr, Oliveira, Ricardo
Mental health risk prediction is a growing field in the speech community, but many studies are based on small corpora. This study illustrates how variations in test and train set sizes impact performance in a controlled study. Using a corpus of over 65K labeled data points, results from a fully crossed design of different train/test size combinations are provided. Two model types are included: one based on language and the other on speech acoustics. Both use methods current in this domain. An age-mismatched test set was also included. Results show that (1) test sizes below 1K samples gave noisy results, even for larger training set sizes; (2) training set sizes of at least 2K were needed for stable results; (3) NLP and acoustic models behaved similarly with train/test size variations, and (4) the mismatched test set showed the same patterns as the matched test set. Additional factors are discussed, including label priors, model strength and pre-training, unique speakers, and data lengths. While no single study can specify exact size requirements, results demonstrate the need for appropriately sized train and test sets for future studies of mental health risk prediction from speech and language.
Information Theory Inspired Pattern Analysis for Time-series Data
Huang, Yushan, Zhao, Yuchen, Capstick, Alexander, Palermo, Francesca, Haddadi, Hamed, Barnaghi, Payam
Current methods for pattern analysis in time series mainly rely on statistical features or probabilistic learning and inference methods to identify patterns and trends in the data. Such methods do not generalize well when applied to multivariate, multi-source, state-varying, and noisy time-series data. To address these issues, we propose a highly generalizable method that uses information theory-based features to identify and learn from patterns in multivariate time-series data. To demonstrate the proposed approach, we analyze pattern changes in human activity data. For applications with stochastic state transitions, features are developed based on Shannon's entropy of Markov chains, entropy rates of Markov chains, entropy production of Markov chains, and von Neumann entropy of Markov chains. For applications where state modeling is not applicable, we utilize five entropy variants, including approximate entropy, increment entropy, dispersion entropy, phase entropy, and slope entropy. The results show the proposed information theory-based features improve the recall rate, F1 score, and accuracy on average by up to 23.01% compared with the baseline models and a simpler model structure, with an average reduction of 18.75 times in the number of model parameters.
Accurate Prediction of Global Mean Temperature through Data Transformation Techniques
Niyogi, Debdarsan, Srinivasan, J.
It is important to predict how the Global Mean Temperature (GMT) will evolve in the next few decades. The ability to predict historical data is a necessary first step toward the actual goal of making long-range forecasts. This paper examines the advantage of statistical and simpler Machine Learning (ML) methods instead of directly using complex ML algorithms and Deep Learning Neural Networks (DNN). Often neglected data transformation methods prior to applying different algorithms have been used as a means of improving predictive accuracy. The GMT time series is treated both as a univariate time series and also cast as a regression problem. Some steps of data transformations were found to be effective. Various simple ML methods did as well or better than the more well-known ones showing merit in trying a large bouquet of algorithms as a first step. Fifty-six algorithms were subject to Box-Cox, Yeo-Johnson, and first-order differencing and compared with the absence of them. Predictions for the annual GMT testing data were better than that published so far, with the lowest RMSE value of 0.02 $^\circ$C. RMSE for five-year mean GMT values for the test data ranged from 0.00002 to 0.00036 $^\circ$C.
Algebraic tests of general Gaussian latent tree models
We consider general Gaussian latent tree models in which the observed variables are not restricted to be leaves of the tree. Extending related recent work, we give a full semi-algebraic description of the set of covariance matrices of any such model. In other words, we find polynomial constraints that characterize when a matrix is the covariance matrix of a distribution in a given latent tree model. However, leveraging these constraints to test a given such model is often complicated by the number of constraints being large and by singularities of individual polynomials, which may invalidate standard approximations to relevant probability distributions. Illustrating with the star tree, we propose a new testing methodology that circumvents singularity issues by trading off some statistical estimation efficiency and handles cases with many constraints through recent advances on Gaussian approximation for maxima of sums of high-dimensional random vectors. Our test avoids the need to maximize the possibly multimodal likelihood function of such models and is applicable to models with larger number of variables. These points are illustrated in numerical experiments.
Algebraic tests of general Gaussian latent tree models
We consider general Gaussian latent tree models in which the observed variables are not restricted to be leaves of the tree. Extending related recent work, we give a full semi-algebraic description of the set of covariance matrices of any such model. In other words, we find polynomial constraints that characterize when a matrix is the covariance matrix of a distribution in a given latent tree model. However, leveraging these constraints to test a given such model is often complicated by the number of constraints being large and by singularities of individual polynomials, which may invalidate standard approximations to relevant probability distributions. Illustrating with the star tree, we propose a new testing methodology that circumvents singularity issues by trading off some statistical estimation efficiency and handles cases with many constraints through recent advances on Gaussian approximation for maxima of sums of high-dimensional random vectors. Our test avoids the need to maximize the possibly multimodal likelihood function of such models and is applicable to models with larger number of variables. These points are illustrated in numerical experiments.
A Hybrid Q-Learning Sine-Cosine-based Strategy for Addressing the Combinatorial Test Suite Minimization Problem
Zamli, Kamal Z., Din, Fakhrud, Ahmed, Bestoun S., Bures, Miroslav
The sine-cosine algorithm (SCA) is a new population-based meta-heuristic algorithm. In addition to exploiting sine and cosine functions to perform local and global searches (hence the name sine-cosine), the SCA introduces several random and adaptive parameters to facilitate the search process. Although it shows promising results, the search process of the SCA is vulnerable to local minima/maxima due to the adoption of a fixed switch probability and the bounded magnitude of the sine and cosine functions (from -1 to 1). In this paper, we propose a new hybrid Q-learning sine-cosine- based strategy, called the Q-learning sine-cosine algorithm (QLSCA). Within the QLSCA, we eliminate the switching probability. Instead, we rely on the Q-learning algorithm (based on the penalty and reward mechanism) to dynamically identify the best operation during runtime. Additionally, we integrate two new operations (L\'evy flight motion and crossover) into the QLSCA to facilitate jumping out of local minima/maxima and enhance the solution diversity. To assess its performance, we adopt the QLSCA for the combinatorial test suite minimization problem. Experimental results reveal that the QLSCA is statistically superior with regard to test suite size reduction compared to recent state-of-the-art strategies, including the original SCA, the particle swarm test generator (PSTG), adaptive particle swarm optimization (APSO) and the cuckoo search strategy (CS) at the 95% confidence level. However, concerning the comparison with discrete particle swarm optimization (DPSO), there is no significant difference in performance at the 95% confidence level. On a positive note, the QLSCA statistically outperforms the DPSO in certain configurations at the 90% confidence level.